Anomaly detection in search queries refers to the process of identifying unusual or unexpected patterns in search query data
Search queries represent the queries made by users in a search engine or a database to retrieve specific information. Anomalies in search queries can indicate various issues such as technical glitches, spam attacks, changes in user behavior, or emerging trends.
To perform anomaly detection in search queries , we can follow these process:
Data Preprocessing:
Exploratory Data Analysis (EDA):
Model Selection:
Anomaly Detection:
Evaluation:
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from collections import Counter
import re #regular Expression
import plotly.express as px # creating a figure,plot,chart
import plotly.io as pio
pio.templates.default = "plotly_white"
import matplotlib.pyplot as plt #for various visualizations mainly focused on 2D and 1D data representations.
from wordcloud import WordCloud #create word frequency visual representations based on your data.
from sklearn.ensemble import IsolationForest # Isolation Forest used for anomaly or outlier detections in the data
C:\Users\aditya\Anaconda3\lib\site-packages\pandas\core\computation\expressions.py:20: UserWarning: Pandas requires version '2.7.3' or newer of 'numexpr' (version '2.7.1' currently installed). from pandas.core.computation.check import NUMEXPR_INSTALLED
df_SearchQueries = pd.read_csv('queries.csv')
df_SearchQueries.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1000 entries, 0 to 999 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Top queries 1000 non-null object 1 Clicks 1000 non-null int64 2 Impressions 1000 non-null int64 3 CTR 1000 non-null object 4 Position 1000 non-null float64 dtypes: float64(1), int64(2), object(2) memory usage: 39.2+ KB
df_SearchQueries.head(5) #Top 5 Search Queries
| Top queries | Clicks | Impressions | CTR | Position | |
|---|---|---|---|---|---|
| 0 | lotus healthcare | 114 | 2028 | 5.62% | 4.26 |
| 1 | groom facial package | 21 | 1495 | 1.4% | 13.37 |
| 2 | lotus health care | 20 | 376 | 5.32% | 4.86 |
| 3 | bride groom facial package | 16 | 159 | 10.06% | 3.52 |
| 4 | best diabetes doctor in pune | 14 | 1213 | 1.15% | 10.57 |
df_SearchQueries.tail(5) #Bottom 5 Search Queries
| Top queries | Clicks | Impressions | CTR | Position | |
|---|---|---|---|---|---|
| 995 | facial skin peeling | 0 | 9 | 0% | 8.56 |
| 996 | lotus health clinic & diagnostics | 0 | 9 | 0% | 9.33 |
| 997 | best sugar doctor near me | 0 | 9 | 0% | 10.89 |
| 998 | anti aging clinic | 0 | 9 | 0% | 12.67 |
| 999 | diabetes reversal clinic near me | 0 | 9 | 0% | 13.11 |
df_SearchQueries.isna().sum() #Checking for null values
Top queries 0 Clicks 0 Impressions 0 CTR 0 Position 0 dtype: int64
# Filter the DataFrame to select Top queries whose 'Position' is 1.00
filtered_queries = df_SearchQueries[df_SearchQueries['Position'] == 1.00]
count_of_filtered_queries = len(filtered_queries)
# Print the count query
print(f"\033[1mTotal Top queries with Position 1.00: = {count_of_filtered_queries}\033[0m")
for query in filtered_queries['Top queries']:
print(query)
Total Top queries with Position 1.00: = 8
local guide program
diabetologist near me
aesthetic spa
ayurvedic treatment
laser treatment for dark circles
laser treatment for facial hair
hair specialist near me
peeling skin
# This function cleans and splits a search query into individual words, removing stop words nd preparing them for
# further analysis.
def clean_and_split(query):
# Define words to exclude (a list of stop words)
exclude_words = {"in", "for", "near", "me", "and","pune"}
# Use regex to find words (converted to lowercase for case-insensitivity.)
words = re.findall(r'\b[a-zA-Z]+\b', query.lower())
# Filter out exclude words
words = [word for word in words if word not in exclude_words]
return words
# Split each query into words and count the frequency of each word
word_counts = Counter()
for query in df_SearchQueries['Top queries']:
word_counts.update(clean_and_split(query))
word_freq_df = pd.DataFrame(word_counts.most_common(25),
columns=['Word', 'Frequency'])
# Plotting the word frequencies
fig = px.bar(word_freq_df, x='Word', y='Frequency', title='Top 25 Most Common Words in Search Queries')
fig.show()
def wordCloud_generator(df_SearchQueries, title=None):
wordcloud = WordCloud(width = 800, height = 800,
background_color ='#0000',
min_font_size = 10
).generate(" ".join(df_SearchQueries.values))
# plot the WordCloud image
plt.figure(figsize = (12, 12), facecolor = 'White')
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.tight_layout(pad = 0)
plt.title(title,fontsize=30)
plt.show()
wordCloud_generator(df_SearchQueries['Top queries'], title="Word Cloud")
print(word_freq_df)
Word Frequency 0 treatment 268 1 skin 189 2 diabetes 115 3 lotus 104 4 best 89 5 hair 85 6 clinic 84 7 laser 82 8 care 66 9 doctor 51 10 removal 46 11 bridal 44 12 holistic 42 13 reversal 40 14 polishing 40 15 ayurvedic 37 16 package 35 17 health 35 18 packages 34 19 aesthetic 34 20 program 33 21 specialist 33 22 groom 32 23 anti 29 24 loss 28
# Top queries by Clicks and Impressions
top_queries_clicks = df_SearchQueries.nlargest(10, 'Clicks')[['Top queries', 'Clicks']]
top_queries_impressions = df_SearchQueries.nlargest(10, 'Impressions')[['Top queries', 'Impressions']]
# Plotting
fig_clicks = px.bar(top_queries_clicks, x='Top queries', y='Clicks', title='Top Queries by Clicks')
fig_impressions = px.bar(top_queries_impressions, x='Top queries', y='Impressions', title='Top Queries by Impressions')
fig_clicks.show()
fig_impressions.show()
#Convert Numerical Float to integer
if df_SearchQueries['CTR'].dtype != object:
# Convert to string if necessary
df_SearchQueries['CTR'] = df_SearchQueries['CTR'].astype(str)
# Handle missing values (if applicable)
df_SearchQueries['CTR'].fillna("", inplace=True) # Replace missing values with empty strings
# Now you can apply string methods
df_SearchQueries['CTR'] = df_SearchQueries['CTR'].str.rstrip('%').astype('float') / 100
# Queries with highest and lowest CTR
top_ctr = df_SearchQueries[df_SearchQueries['CTR'] > 0]
top_ctr_filtered = top_ctr.nlargest(10, 'CTR')[['Top queries', 'CTR']]
bottom_ctr = df_SearchQueries[df_SearchQueries['CTR'] > 0]
bottom_ctr_filtered = bottom_ctr.nsmallest(10, 'CTR')[['Top queries', 'CTR']]
# Plotting
fig_top_ctr = px.bar(top_ctr_filtered, x='Top queries', y='CTR', title='Top Queries by CTR')
fig_bottom_ctr = px.bar(bottom_ctr_filtered, x='Top queries', y='CTR', title='Bottom Queries by CTR')
fig_top_ctr.show()
fig_bottom_ctr.show()
In scientific notation, "micro-" denotes a factor of 10−610−6. So, for example, 200μ would be equivalent to 200×10−6200×10−6 or 0.0002.
# Correlation matrix visualization
correlation_matrix = df_SearchQueries[['Clicks', 'Impressions', 'CTR', 'Position']].corr()
fig_corr = px.imshow(correlation_matrix, title='Correlation Matrix' , color_continuous_scale='Viridis')
# Customize text displayed on the heatmap
fig_corr.update_traces(text=correlation_matrix.values.round(2))
# Add text annotations
annotations = []
for i, row in enumerate(correlation_matrix.index):
for j, value in enumerate(correlation_matrix.columns):
annotations.append(dict(
text=str(correlation_matrix.iloc[i, j].round(2)),
x=j,
y=i,
xref='x',
yref='y',
showarrow=False,
font=dict(color='white')
))
fig_corr.update_layout(annotations=annotations)
fig_corr.show()
In the correlation matrix, we observe the following relationships between key metrics:
These interpretations provide valuable insights into the relationships between the variables in your dataset, helping you understand how changes in one variable may affect another.
Now, let’s detect anomalies in search queries. You can use various techniques for anomaly detection. A simple and effective method is the Isolation Forest algorithm, which works well with different data distributions and is efficient with large datasets:
# Selecting relevant features
features = df_SearchQueries[['Clicks', 'Impressions', 'CTR', 'Position']]
# Initializing Isolation Forest
iso_forest = IsolationForest(n_estimators=100, contamination=0.01) # contamination is the expected proportion of outliers
# Fitting the model
iso_forest.fit(features)
# Predicting anomalies
df_SearchQueries['anomaly'] = iso_forest.predict(features)
# Filtering out the anomalies
anomalies = df_SearchQueries[df_SearchQueries['anomaly'] == -1]
print(anomalies[['Top queries', 'Clicks', 'Impressions', 'CTR', 'Position']])
Top queries Clicks Impressions CTR Position 0 lotus healthcare 114 2028 0.0562 4.26 1 groom facial package 21 1495 0.0140 13.37 2 lotus health care 20 376 0.0532 4.86 3 bride groom facial package 16 159 0.1006 3.52 6 lotus clinic 11 2526 0.0044 20.26 7 diabetes reversal program 10 6281 0.0016 42.95 21 lotushealthcare 6 15 0.4000 3.53 171 aesthetic spa 1 1 1.0000 1.00 172 ayurvedic treatment 1 1 1.0000 1.00 173 diabetologist pune 1 1 1.0000 3.00
Some key observations and potential anomalies in search queries are as follows:
Low Click-Through Rates (CTRs):
• lotus healthcare: This query receives a high number of impressions (2028) but a low CTR (0.0562). This indicates potential issues with the landing page content or ad targeting, as many users see the ad but few click through.
• groom facial package: Similar to the above, this query has a high impression count (1495) but a low CTR (0.0140). Investigate the landing page and targeting for this query to improve its effectiveness.
• lotus clinic: Although not as drastic, this query also has a low CTR (0.0044) despite a decent number of impressions (2526). Further analysis is needed to understand the cause.
High CTRs:
• bride groom facial package: This query has a relatively high CTR (0.1006) compared to others. Understanding what resonates with users here could inform broader content and targeting strategies.
• aesthetic spa, ayurvedic treatment, diabetologist pune: These queries with perfect CTRs (1.0000) and low impressions might be specific searches with high intent, but their limited data points require further investigation for confirmation.
Other Points:
• lotus healthcare variations: Multiple variations of "lotus healthcare" appear in the search.
• diabetes reversal program: This query has a very low CTR (0.0016) despite a significantly high number of impressions (6281).Might be Issue with content. Analyze user behavior and optimize accordingly.
Search queries are goldmines of information, but not all queries perform the same. Search query anomaly detection unlocks this potential by identifying outliers in performance metrics like clicks, impressions, and click-through rates (CTRs).
This helps businesses:
• Catch issues early: Identify queries with unexpectedly low CTRs (potentially underperforming content) or high impressions but low clicks (indicating targeting problems).
• Discover opportunities: Spot queries with surprisingly high CTRs (potential content hits) or growing impressions (trending topics).
• Optimize content and advertising: Use these insights to improve content relevance, ad targeting, and overall search performance.
Overall, anomaly detection in search queries provides valuable insights to identify underperforming areas, uncover hidden opportunities, and optimize your search strategy for better results.